\(~\)
Some of this material was (in a modified version) created by Mette Langaas who has put a lot of effort in creating this module in its original version. Thanks to Mette for the permission to use the material!
Some of the figures and slides in this presentation are taken (or are inspired) from James et al. (2021).
James et al (2021): An Introduction to Statistical Learning. Chapter 10.
All the material presented on these module slides and in class.
Videos on neural networsk and back propagation
\(~\)
Secondary material (not compulsory):
\(~\)
See also References and further reading (last slide), for further reading material.
\(~\)
\(~\)
1950’s: First neural networks (NN) in “toy form”.
1980s: the backpropagation algorithm was rediscovered.
1989: (Bell Labs, Yann LeCun) used convolutional neural networks to classifying handwritten digits.
2000s: After the first hype, NNs were pushed aside by boosting and support vector machines in the 2000s.
Since 2010: Revival! The emergence of Deep learning as a consequence of improved computer resources, some innovations, and applications to image and video classification, and speech and text processing.
\(~\)
\(~\)
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals.
Image credits: By Egm4313.s12 (Prof. Loc Vu-Quoc) https://commons.wikimedia.org/w/index.php?curid=72816083
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
According to Chollet and Allaire (2018) (page 19):
Machine learning isn’t mathematics or physics, […] it’s an engineering science.
\(~\)
\(~\)
\(~\)
\(~\)
The success of deep learning is dependent upon the breakthroughts in
\(~\)
Achievements of deep learning includes
test
\(~\)
Alternative: networks with more than one hidden layer. A network with many hidden layers is called a deep network.
(Fig 10.4 James et al. (2021))
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
Objective: classify the digit contained in an image (28 \(\times\) 28 greyscale).
\(~\)
We now focus on the different elements of neural networks.
\(~\)
\(~\)
These choices have been guided by solutions in statistics (multiple linear regression, logistic regression, multiclass regression)
\(~\)
: for continuous outcome (regression problems) \[f(X)=X \ .\]
: for binary outcome (two-class classification problems) \[f(X)=\text{Pr}(Y=1 | X ) = \frac{1}{1+\exp(-X)} = \frac{\exp(X)}{1+\exp(X)} \ .\]
: for multinomial/categorical outcome (multi-class classification problems) \[ f_m(X) = \text{Pr}(Y=m | X ) = \frac{\exp(Z_m)}{\sum_{s=1}^{C}\exp(Z_s)} \ . \]
Note that we denote by \(Z_m\) the value in the output node \(m\) before the output layer activation.
\(~\)
\(~\)
\(~\)
The universal approximation theorem says that a feedforward network with
can approximate any (Borel measurable) function from one finite-dimensional space (our input layer) to another (our output layer) with any desired non-zero amount of error.
\(~\)
Network architecture contains three components:
\(~\)
Width: How many nodes are in each layer of the network?
Depth: How deep is the network (how many hidden layers)?
Connectivity: How are the nodes connected to each other?
\(~\)
Especially the connectivity depends on the problem, and here experience is important.
\(~\)
However, the recent practice is to
\(~\)
\(~\)
This simplifies the choice of network architecture to choose a large enough network.
\(~\)
See e.g. Chollet and Allaire (2018), Section 4.5.6/7 and Goodfellow, Bengio, and Courville (2016), Section 7
\(~\)
\(~\)
| Problem | Output nodes | Output activation | Loss function |
|---|---|---|---|
| Regression | 1 | linear |
mse |
| Classification (C=2) | 1 | sigmoid |
binary_crossentropy |
| Classification (C>2) | C | softmax |
categorical_crossentropy |
\[J({\boldsymbol \theta}) = \sum_{i=1}^n (y_i- f(x_i))^2\]
\(~\)
\[J({\boldsymbol \theta}) = -\sum_{i=1}^n \sum_{m=1}^C y_i \log f_m(x_i) \ ,\]
\(~\)
Let the unknown parameters be denoted \({\boldsymbol \theta}\) (what we have previously denoted as \(\alpha\)s and \(\beta\)s), and the loss function to be minimized \(J({\boldsymbol \theta})\).
\(~\)
Gradient descent
Mini-batch stochastic gradient descent (SGD) and true SGD
Backpropagation
\(~\)
\(~\)
(https://github.com/SoojungHong/MachineLearning/wiki/Gradient-Descent)
\(~\)
\(~\)
Q: Why are we moving in the direction of the negative of the gradient? Why not the positive?
A:
\(~\)
Note that in full gradient descent, the loss function is computed as a mean over all training examples. \[ J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n J({\boldsymbol x}_i, y_i) \ . \]
The gradient is an average over many individual gradients from the training example. You can think of this as an estimator for an expectation.
\[ \nabla_{\boldsymbol \theta} J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n \nabla_{\boldsymbol \theta} J({\boldsymbol x}_i, y_i) \ . \]
\(~\)
Crucial idea:
The expectation can be approximated by the average gradient over just a (random sample) of the observations.
\(~\)
Advantages:
Mini-batch stochastic gradient descent
Special case: involves only (mini-batch size 1). \(\rightarrow\) Mini-batch SGD is a compromise between SGD (one sample per iteration) and full gradient descent (full dataset per iteration)
In the 3rd video (on backpropagation) from 3Blue1Brown there is nice example of one trajectory from gradient decent and one from SGD (10:10 minutes into the video): https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3
\(~\)
More background:
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\[ \tilde{J}({\boldsymbol w})= \frac{\alpha}{2}{{\boldsymbol w}^\top{\boldsymbol w}} + J({\boldsymbol w}) \ .\]
Based on Goodfellow, Bengio, and Courville (2016), Section 7.8
\(~\)
Based on Goodfellow, Bengio, and Courville (2016), Section 7.12, and Chollet and Allaire (2018) 4.4.3
\(~\)
Dropout was developed by Geoff Hinton and his students.
\(~\)
During training: randomly dropout (set to zero) some outputs in a given layer at each iteration. Drop-out rates may be chosen between 0.2 and 0.5.
During test: no dropout, but scale down the layer output values by a factor equal to the drop-out rate (since now more units are active than we had during training).
Alternatively, the drop-out and scaling (now upscaling) can be done during training.
Fig 10.19, James et
al. (2021)
\(~\)
Many hyperparameters when building and fitting a neural network, like the network architecture, the number of batches to run before terminating the optimization, the drop-out rate, etc.
\(~\)
To avoid overfit, we have some strategies:
Reduce network size.
Collect more observations.
Regularization.
\(~\)
It is important that the hyperparameters are chosen on a validation set or by cross-validation.
However, we may run into validation-set overfitting: when using the validation set to decide many hyperparameters, so many that you may effectively overfit the validation set.
\(~\)
We will use both the rather simple nnet R package by
Brian Ripley and the currently very popular keras package
for deep learning (the keras package will be presented
later).
nnet fits one hidden layer with sigmoid
activiation function. The implementation is not gradient descent,
but instead BFGS using optim.
?nnet() into your R-console to see the arguments
of nnet().Objective: To predict the median price of owner-occupied homes in a given Boston suburb in the mid-1970s using 10 input variables.
This data set is both available in the MASS and
keras R package.
\(~\)
Read and check the data file:
library(MASS)
data(Boston)
dataset <- Boston
head(dataset)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Preparation: Split into training and test data:
set.seed(123)
tt.train <- sort(sample(1:506, 404, replace = FALSE))
train_data <- dataset[tt.train, 1:13]
train_targets <- dataset[tt.train, 14]
test_data <- dataset[-tt.train, 1:13]
test_targets <- dataset[-tt.train, 14]
\(~\)
org_train = train_data
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)
\(~\)
Just checking out one hidden layer with 5 units to get going.
\(~\)
library(nnet)
fit5 <- nnet(train_targets ~ ., data = train_data, size = 5, linout = TRUE,
maxit = 1000, trace = F)
\(~\)
Calculate the MSE and the mean absolute error:
\(~\)
pred = predict(fit5, newdata = test_data, type = "raw")
mean((pred[, 1] - test_targets)^2)
## [1] 30.98705
mean(abs(pred[, 1] - test_targets))
## [1] 2.688098
library(NeuralNetTools)
plotnet(fit5)
keras\(~\)
See recommended exercise.
\(~\)
\(~\)
(Example from the CIFAR-10 data set with only 10 classes).
\(~\)
\(~\)
\[\left[ \begin{matrix} a & b & c \\ d & e & f \\ g & h & i\\ j & k & l \\ \end{matrix} \right] \qquad \text{Convolved with } \qquad \left[ \begin{matrix} \alpha & \beta \\ \gamma & \delta \\ \end{matrix}\right] \] \(\rightarrow\) Convolved image:
\(~\)
\(~\)
The filter highlights regions in the image that are similar to the filter itself.
Filtering for vertical or horizontal stripes:
(Figure 10.7)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
In a CNN, we now combine convolution and pooling steps interatively:
The number of channels after a convolution step is the number of filters (\(K\)) that is used in this iteration.
The dimension of the 2D images after a pooling step is reduced, depending on the dimension of the filter (e.g., \(2\times 2\) reduces each dimension by a factor of 2).
In the end, all the dimensions are flattened (pixels become ordered in 2D).
The output layer has a softmax activation function since the aim is classification.
\(~\)
\(~\)
Figure 10.9 of James et al. (2021)
\(~\)
See
Section 10.3.5 in the book,
Examples in the recommended exercise 11.
\(~\)
Figure 10.12 of James et al. (2021)
Observed sequence \(X=\{ X_1, \ldots , X_L \}\), where each \(X_l^\top=(X_{l1},\ldots, X_{lp})\) is an input vector at point \(l\) in the sequence.
Sequence of hidden layers \(\{ A_1, \ldots, A_L \}\), where each \(A_l\) is a layer of \(K\) units \(A_l^\top = (A_{l1}, \ldots , A_{lK})\).
\(A_{lk}\) is determined as \[\begin{equation}\label{eq:chain} A_{lk} = g(w_{k0} + \sum_{j=1}^p w_{kj}X_{lj} + \sum_{s=1}^K u_{ks}A_{l-1,s}) \ , \end{equation}\]
with hidden layer activation function \(g()\) (e.g., ReLU).
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
Why are the outputs \(O_1, \ldots, O_{L-1}\) there at all?
\(~\)
A:
They come for free (same weights \(\boldsymbol{B}\)).
Sometimes, the output is a whole sequence.
Trading statistics from New York Stock exchange:
Figure 10.14 of James et al. (2021)
\(~\)
Observations:
\(~\)
Every day (\(t=1,\ldots, 6051\)) we measure three things, denoted as \((v_t, r_t, z_t)\).
All three series have high auto-correlation.
\(~\)
\(~\)
Aim:
\(~\)
But, how do we represent this problem in terms of Figure 10.12?
The idea is to extract shorter series up to a lag of length \(L\):
\(~\)
\[X_1 = \left( \begin{matrix} v_{t-L}\\ r_{t-L}\\ z_{t-L} \end{matrix} \right), \ \quad X_2 = \left( \begin{matrix} v_{t-L+1}\\ r_{t-L+1}\\ z_{t-L+1} \end{matrix} \right), ... , \quad X_L = \left( \begin{matrix} v_{t-1}\\ r_{t-1}\\ z_{t-1} \end{matrix} \right), \quad Y = v_t\]
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
Salary for 263
baseball players.\(~\)
We compare
\(~\)
Conclusions?
Lots of papers like those:
\(~\)
\(~\)
Very recent paper: